LWN: Comments on "Fedora and Python 2"

Fedora and Python 2

RooTer — Sat, 28 Apr 2018 20:22:59 +0000

> What if you need some Python 3 features but without the Python3 encode/decode string-hell?

Having developed python apps in both Python 2 and 3 for years, I would say the encode/decode hell exists in Python 2 realm, not 3.
Seems as stupid `UnicodeDecodeError`plague almost every Python 2 project, and switch to Python 3 would be a good idea just for the clear str/bytes distinction.

Fedora and Python 2

flussence — Sat, 28 Apr 2018 20:07:47 +0000

>Can you think of any other situation in free software where a team that has abandoned a codebase has not just discouraged someone willing from taking over maintenance, but in fact used legal pressure to prevent it?
A few years back the libav gang attempted to sue the legitimate FFmpeg project out of existence over its use of the logo. They failed because they didn't actually own any rights to the image to begin with; a third party contributed it. Hasn't stopped them using it, mind you.

This Python naming dispute isn't an act of malice - it's a simple trademark defence. Mozilla does exactly the same thing, which is why we have IceCat, IceWeasel, PaleMoon, Seamonkey etc.

Fedora and Python 2

smurf — Tue, 17 Apr 2018 20:22:52 +0000

What do you mean, "abandoned"? It's not. The Python 2 branch contains a whole lot more changes (looking at the last half year) than Tauthon.

Fedora and Python 2

rahulsundaram — Tue, 17 Apr 2018 14:00:47 +0000

>Can you think of any other situation in free software where a team that has abandoned a codebase has not just discouraged someone willing from taking over maintenance

That's not what happened here. Maintenance of the codebase is perfectly fine. The name however is not free to use for forks. That is a situation that is quite common in Free software projects.

Fedora and Python 2

bandrami — Tue, 17 Apr 2018 13:27:40 +0000

You're leaving out a HUGE part there. Someone took over development of Python 2 *after it had been abandoned by its original team* and was threatened with legal action for keeping the name if he maintained that language.

Can you think of any other situation in free software where a team that has abandoned a codebase has not just discouraged someone willing from taking over maintenance, but in fact used legal pressure to prevent it?

Fedora and Python 2

dvdeug — Sun, 15 Apr 2018 20:53:33 +0000

It adds time linear in the amount of text being processed. Since it only needs to touch text being processed, and processing it already takes at least time linear in the amount of text being processed, it does not change the Big O of your operation at all. It scales by definition. I suspect even in Python it will always be trivial in the amount of time it takes, but the issue is certainly not whether it scales.

We're not talking about human readable symbols; we're talking about "non-ASCII symbols in your program" that aren't Unicode. Even if the editor mangles it for you, how is bsymFFEAA9 worse than \xff\xea\xa9? Something slightly smarter than base64 would preserve human readable names and only mangle unreadable names, but the only case where not worrying about mangling is going to cause problems in Python 3 is when it's not human readable.

Fedora and Python 2

togga — Sun, 15 Apr 2018 15:13:10 +0000

> "then encoding them using base64 is trivial (and again, you don't care about the Python symbol names so why do you care if they're line noise or encoded line noise?)"

1. Doesnt scale. Changing representation requires one additional pass over the data. Python is already slow to begin with.
2. Accessing human readable symbols is convenient when needed by scripts, tests or debug.

Han Unification means UCS-2 was doomed and UTF-16 makes no sense

DHR — Sun, 15 Apr 2018 01:14:34 +0000

When UNICODE-1 (ucs-2) was designed, with a maximum of 64k code points, it was 100% forseeable that this was a mistake.

The key was it require "Han unification". Japanese, traditional Chinese, Korean, and simplified Chinese symbols would have to share code points. This was never going to be acceptable to people using those languages.

The analogy I heard was: would the Greek and Roman alphabets share code points? Alpha and A are really the same, are they not? How about Aleph? No way!

The fact is that UNICODE-1 was doomed before birth.

UTF-16 was always a bad idea. Some tried to ignore that and we live with that mistake.

<https://en.wikipedia.org/wiki/Han_unification>

UTF-8 was designed by the Plan9 folks. Quite early. On a napkin. Some of them later brought us Go.

Fedora and Python 2

SiB — Sat, 14 Apr 2018 16:48:38 +0000

Exactly!

In our department (physics) we use python for data analysis and for instrumentation control (including space flight). Python 3 is perfectly fine for the data analysis. Instrumentation control uses the python repl as commanding interface, where python 2 is still ahead.

Fedora and Python 2

excors — Sat, 14 Apr 2018 15:15:01 +0000

Perhaps the issue is that some people (particularly people on LWN) are more interested in systems programming, and their programs typically deal with protocols and file formats and APIs that are primarily byte-based and occasionally contain human-readable text, whereas other people are more interested in e.g. web programming where their data is primarily human-readable inputs and outputs and modern Unicode-based file formats (HTML, CSS, etc).

People in the first category might understand Unicode perfectly well, but they often need to deal with e.g. filenames (which aren't really Unicode on Linux or Windows), or with e.g. HTTP headers (where the encoding is unclearly specified and real data often violates the specification anyway), and they want a language that makes it easy and natural to process data like that. Python 3 makes it less easy and less natural than Python 2, since the language and the libraries tend to default to Unicode strings, so those people are unhappy. Meanwhile people in the second category prefer having everything be Unicode by default, since that's all they use anyway. Neither side is wrong or ignorant, they just have different use cases and different requirements, and Python failed to find a way to satisfy both groups.

Fedora and Python 2

peniblec — Sat, 14 Apr 2018 13:38:25 +0000

Fair enough. I’ve mostly only worked in Python 3 codebases, and the only place where I hear people debate the str-vs-bytes business is on LWN.

That restricts my sample of arguments against Python 3 to the high-level design issues I mentioned; I have not been “in the trenches” migrating sloppy code to Python 3. In my imagination the “characters are bytes” camp (and their code) had been dissolved during the noughties; I guess that was wishful thinking :)

Fedora and Python 2

peniblec — Sat, 14 Apr 2018 11:48:39 +0000

OK. Let’s say Python’s string type uses normalization/grapheme clusters/nanomachines to correctly compare sequences of Unicode characters. Would that necessarily make a text editor overzealously normalize your whole file, thus polluting your patch?

I don’t know how actual text editors do it, but I imagine that their representation of your file’s content is more nuanced than simply “whatever open(filename) returned”. I would assume that they represent a “file” as sequences of opaque “word” or “line” objects, each of those objects having methods to

get their position in the file’s byte-stream (start and end offset, cached once decoded), so that the editor knows where to apply changes;
get their “canonical” Unicode representation, so that the editor can do whatever an editor is supposed to do with meat-space characters (comparison for search-and-replace, length computation for line-wrapping).

So with such a design, I don’t think “Python’s str canonicalizing behind your back” would necessarily lead to “OMG this commit is full of extraneous crap introduced by this dumb Python text editor”. Again, I might not have thought enough about this, maybe the above does nothing to solve the problem.

(Congratulations, you’ve nerd-sniped me into designing a text editor ;) )

Alternative workaround: teach our diffing tools to normalize text before computing differences :D

They do already let us skip whitespace changes, for example, which is a subclass of the more general category of “things computers care about despite being mostly irrelevant to meatbags”.

Fedora and Python 2

dvdeug — Sat, 14 Apr 2018 06:26:12 +0000

If Python normalized text by default, the text editor would have a hard time doing that.

Fedora and Python 2

sfeam — Fri, 13 Apr 2018 21:37:17 +0000

Well that's a bug then, isn't it.

Fedora and Python 2

jwilk — Fri, 13 Apr 2018 19:38:27 +0000

>>> unicodedata.normalize('NFD', u'\u0387') == u'\xB7'
True

Fedora and Python 2

sfeam — Fri, 13 Apr 2018 17:48:40 +0000

U+0387 doesn't "decompose" into anything. It's not a combining form. It is an example of a character in one alphabet whose common written form happens to look like a character from some other alphabet or set of conventional symbols. Because they look similar [in typical fonts] people tend to type whichever is more convenient. But neither one is the "canonical form" of the other. A more familiar pair would be Greek letter "mu" (U+03BC) and the scientific prefix "micro" (U+00B5). The existence of such pairs can be a problem, but it's a different problem than canonicalization. While it might make sense to be suspicious of micro signs appearing in what is otherwise a Greek alphabet URL, it would be a bad idea to replace all micro signs with "mu" (or vice versa) in a document that happened to include both Greek text and quantities in SI units.

Fedora and Python 2

ceplm — Fri, 13 Apr 2018 16:39:21 +0000

Standard reply to every "foo is known to be buggy" is "And what's the bug number?" Also, I would ask the author of the comment why bug cannot be fixed. Doesn't make sense to me.

Fedora and Python 2

HelloWorld — Fri, 13 Apr 2018 12:44:33 +0000

Apparently canonicalisation isn't the solution either. I found this interesting comment elsewhere:
https://mortoray.com/2013/11/27/the-string-type-is-broken...

The essential bit: “Unfortunately, the standard normalisation forms are buggy, and under the current stability policy, cannot be fixed. One example of this that I know is U+387 GREEK ANO TELEIA, which wrongly decomposes canonically (!) into U+00B7 MIDDLE DOT (the Greek name even means literally “upper dot”). This means that some processes may choose to avoid normalisation, because, even the canonical forms risk losing important information.”

Fedora and Python 2

ceplm — Fri, 13 Apr 2018 10:30:51 +0000

My point with the Joel's article was that in my experience large part of people complaining about Python 3 encoding/decoding hell are those who still believe that "string is bunch of bytes" is enough, because they live in the bubble of languages where it is enough (i.e., English and Western European languages). I have converted recently M2Crypto to be py2k/py3k-straddling and I had no problems with Unicode encoding/decoding. What I had problems with, and plenty plenty of them, was that completely messy str/unicode/bytes py2k mess completely confused the real situation. Instead of blaming py3k, I keep blaming py2k and those programmers that their cuckoo-land of "character is one byte" delusion.

And yes, I agree that the implementation in py3k is not perfect, conversion between on-wire eight-bit-per-character to proper str is sometimes problematic, but a lot of work has been spent on it already and the situation is not that bleak, I would call it whatever hell.

Certainly, comparing to the disaster py2k was, py3k is huge improvement.

Fedora and Python 2

Cyberax — Fri, 13 Apr 2018 06:47:41 +0000

Yes, that's exactly what Perl 6 did. It encodes the text into grapheme clusters, the stuff that people think about as characters. They can be directly indexed, used in splits and so on. As far as I know, that's the only mainstream(-ish) language that does this.

I personally wouldn't have been so opposed to Py3 if it were to do the same. Unicode is a hard problem and full support of it might require compromises.

Fedora and Python 2

peniblec — Fri, 13 Apr 2018 06:23:37 +0000

If normalization is so obviously needed before dealing with Unicode strings, wouldn’t it make sense for languages to take care of it by default?

For example, a language’s string-comparison function could automatically make normalized copies of its operands and compare these; users who actually want to compare codepoints could use something like list(s1.codepoints()) == list(s2.codepoints()).

(Not sure what iteration should produce by default, though. Grapheme clusters?)

Maybe performance would take such a hit that it makes sense to let the user ask for normalization explicitly.

Disclaimer: I don’t actually know any language which deals with Unicode strings this way; then again, I don’t actually know many languages.

Fedora and Python 2

peniblec — Fri, 13 Apr 2018 06:05:41 +0000

I may not have thought enough about this, but couldn't this text
editor normalize tokens only for some operations (e.g.
character-count, searching) and otherwise preserve the file's content,
only effectively changing the parts the user actually edited?

Fedora and Python 2

smurf — Fri, 13 Apr 2018 01:38:57 +0000

> the language’s naive handling of these meat-space characters adds hoops to jump through when dealing with those too.

You need to normalize Unicode before doing meaningful things with it. That's a given in any programming language.

You might find fault with the people who invented Unicode. Blaming your (non-)choice of programming language isn't going to help, except that I can think of lots of ways to make it worse. Just look at Java.

Fedora and Python 2

dvdeug — Thu, 12 Apr 2018 23:47:45 +0000

I'm not feeling it here. Having a text file in multiple encodings is incredibly fragile and a pain to work with. If you have to deal with external, non-ASCII symbols in your program, you're going to want to change the names to something you can work with in Python, not a random set of bits. If you're automatically generating code and don't care about the Python symbol names, then encoding them using base64 is trivial (and again, you don't care about the Python symbol names so why do you care if they're line noise or encoded line noise?)

If you're just passing something from system A to system B, you shouldn't have to change the data. But there's a fairly thin region where you can choose to not unmangle something and still expect to be able to do anything with it. Stuff not being clean, nice and tidy is all the more reason to make sure you know exactly how the data you're handling is formatted.

Fedora and Python 2

dvdeug — Thu, 12 Apr 2018 23:18:23 +0000

There's people at Unicode still pissed at UTF-8. We could have had one standard line end marker, one paragraph marker, proper dashes and quotes, but instead we got UTF-8 and forced Unicode to be a fancier ASCII. UTF-1 was sort of ASCII compatible back at the start of Unicode, but it was ... horrifying. Doing "mod 190" to encode characters, anyone?

The other argument is that nobody could have sold a 32 bit encoding in the early 1990s. In 1996, they declared that it was going to have expand from 16 bit to 32 bit (or 20.1 bit). In 2001, Deseret was one of the first scripts encoded beyond above FFFF because they needed to start encoding stuff up there, but they didn't want to start with scripts people were going to fight to keep in the BMP. And yet it wasn't until 2010 that MySQL, even in UTF-8, supported characters above FFFF. Unless they've made some changes since I checked last time, it still bites people that MySQL charset utf8 is for FFFF and below only, and utf8mb4 is needed to actually encode UTF-8.

With a bunch of foresight on everyone's part, it might have been better. But pushing a 32 bit encoding in 1990 could also have mired the idea and left us working with an ISO 2022-style pile of encodings or at least stalled things by a decade where more legacy data in legacy encodings, and even more legacy encodings, were created, and more protocols were designed around the idea that everything has its own encoding instead of everything being in a fixed encoding, or at least a Unicode-compatible encoding.

Fedora and Python 2

HelloWorld — Thu, 12 Apr 2018 23:07:26 +0000

It's interesting how many languages get this wrong. For instance, Java doesn't even give you code points, much less grapheme clusters. Instead, it gives you 16-bit “char” values (“code units” in Unicode-speak) and then you have some methods like codePointAt that give you the code point at some (char-based) position in the string. And when you want to iterate over the code points in a string, I don't know how you're supposed to get from one index to the next, i.e. whether you need to increase the index by one or by two. It might be that you need to compare against Character.MAX_LOW_SURROGATE, but I'm not sure… Needless to say it doesn't help you dealing with grapheme clusters at all, apparently you're supposed to use third-party libraries like icu4j. All in all, it's a clusterfuck (ba-dum tss!)

Fedora and Python 2

dvdeug — Thu, 12 Apr 2018 22:45:21 +0000

There's certainly an argument for normalization, but every person annoyed by Python 3 would likely be more pissed off if, by default, it silently changed text when reading it in. Imagine a text editor where you opened your new novel, "Nous étions à l’étude, quand le Proviseur entra, ..." and changed it to "Nous étudiions, quand le Directeur entra, ..." and fed it back to git to discover a diff that changed every single line in the file.

Fedora and Python 2

togga — Thu, 12 Apr 2018 21:59:14 +0000

I haven't got a clue what you're talking about but given python's dynamic typing this line was quite amusing :-)

> "Explicit is better than implicit" is one of Python's mottos. I happen to think that it's helpful. If you don't, well, there are other languages.

Fedora and Python 2

peniblec — Thu, 12 Apr 2018 20:29:38 +0000

Correct me if I’m wrong, but Joel’s point in this article is that:

It does not make sense to have a string without knowing what encoding it uses. […] If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

To paraphrase, if you have to display any kind of text to a human user, you (or your programming environment) must explicitly know what encoding to use to translate the byte streams you carry around into intelligible characters.

Now AFAIU, when people complain about the “encode/decode string-hell” they are not really disputing this. From what I gather, these people deplore that by default, various parts of Python 3’s standard library expect their inputs to be Unicode characters, in contexts where there is no reason for them to be.

Personally, while I enjoy Python 3 overall, I agree that the decision to have streams default to meat-world characters rather than bytes is debatable. Not every program has to deal with human-readable strings.

Let’s say though that we all collectively agreed that Python having a bias toward human text is a good thing: let’s assume that dealing with byte-streams that do not map to Unicode characters is so rare that having to sprinkle a few bs and .buffers here and there is not a deal-breaker.

Even then, Python’s approach to human text feels somewhat naive: lengths, indexing, iteration and comparison are all based on code points, which AFAIU do not really represent anything meaningful in meat-space.

For example, Python 3 thinks that 'é' != 'é' because one is 'e'+'\N{COMBINING ACUTE ACCENT}' and the other is '\N{LATIN SMALL LETTER E WITH ACUTE}'. My French AZERTY keyboard makes typing the latter straightforward; I understand that GTK applications make it easy to type the former with “e Control-Shift-U 301”.

I can’t think of a program geared toward human interaction that should consider these two strings different. Python does offer unicodedata.normalize() to solve this specific problem; must we rely on every text-handling Python program out there to make its input go through this function? Arguably, shouldn’t the language abstract this minutiae away from us?

tl;dr: While Joel’s article is a classic and a must-read, I’m not sure it addresses the problems raised by Python 3’s critics:

the language’s preference toward meat-space characters adds hoops to jump through when dealing with genuine byte-streams;
the language’s naive handling of these meat-space characters adds hoops to jump through when dealing with those too.

Fedora and Python 2

togga — Thu, 12 Apr 2018 19:41:34 +0000

I get it. When peeling onions in a submarine, all encode/decode issues doesn't feel like hell anymore. Your referenced article made the same progress as the onions regarding Py3 design issue.

Fedora and Python 2

smurf — Thu, 12 Apr 2018 17:52:10 +0000

You can't both expect programs to work with whatever random cruft you feed them, *and* to keep your data safe.

Setting the encoding to whatever is actually used is simple enough – besides, that stuff happens to work correctly when your data and your locale match. Surprise: they usually do. And if you want to process binary data, then use "sys.stdin/out.buffer" (or binary mode). This is documented.

On the other hand, allowing a random mix of differently-encoded strings (which is what Python2 or Perl do) and then trying to disentangle the resulting mojibake (or even figure out what causes it) is a frustrating and sometimes futile exercise in preventing data loss after it's too late. Been there, done that, bitten the carpet.

"Explicit is better than implicit" is one of Python's mottos. I happen to think that it's helpful. If you don't, well, there are other languages.

Fedora and Python 2

togga — Thu, 12 Apr 2018 17:05:20 +0000

Thanks for the heads up in the new world of Python. I figure I should expect the user to have set this PYTHONIOENCODING variable to "random" to begin with.

Scripts should then always start by setting this parameter, or is it to late? Are we talking shell wrappers here or refuse to start if set incorrectly?
If we do multiple things with multiple needs for encoding, do we need different settings for for different incoming data, in other words set it with each read?

Fedora and Python 2

togga — Thu, 12 Apr 2018 16:54:31 +0000

"What, exactly, do you want python to do with random bytes?"

Python2 just keeps them as is, works wonderful.

"Use sys.stdin.buffer to get bytes rather than UTF-8."

Thanks. It worked. I made a compatible version. Awesome.

$ echo -n -e "\xFF" | python3 -c "import sys; S=type('S', (bytes,), {'__repr__': lambda s: bytes.__repr__(s)[1:]}); read_stdin=sys.stdin.buffer.read; sys.stdin.read = lambda: S(read_stdin()); print(repr(sys.stdin.read()))"
'\xff'

Fedora and Python 2

togga — Thu, 12 Apr 2018 16:36:45 +0000

> "Or you're a bit confused."

I'm not confused, I'm just experienced lots of issues I didn't have before Python3's software castle in the air.

> "the easy answer is to just use python strings"

Isn't this kind of bloated. These strings can come from anywhere and might not even be visible in Python code at all.

> "Decode the symbol name to a string with errors='surrogateescape' for use in python, and use the same error handler to decode back to the original bytes for getting the symbol out of the library."
> "Or you could add a layer of indirection between the structure field names and the function symbols."

You mean use Python3 and stick with tons of workarounds and issues just for it's sake? Change the whole world to Python3? I value my time much more than that.

Fedora and Python 2

togga — Thu, 12 Apr 2018 16:29:08 +0000

I use Python as a script language in the sense of a glue language that adapts to the world, you seem to have the ambition to change the world to adapt to Python. Python and UTF8, is not THAT good :-) For me Python is rather quite old, bloated and tired, and I don't even want to get started with UTF8 as some sort of universal data representation...

The latter sounds to me like utopia and an endless job for achieving nothing but explains lots of the attitudes of the Python community.

Fedora and Python 2

ceplm — Thu, 12 Apr 2018 14:26:59 +0000

> What if you need some Python 3 features but without the Python3 encode/decode string-hell?

There is no encode/decode hell, there are only programmers who should peel onions in submarine (https://wp.me/p83KNI-eH).

Fedora and Python 2

mjblenner — Thu, 12 Apr 2018 10:24:45 +0000

> And remember to never, ever mix them in the same program. Oh, and it'll mostly work if you forget .buffer in one place. It'll just crash with bad data sometimes.

Never ever mix them because if you do it will mostly work? (OK...)

Anyway, sounds like you need

PYTHONIOENCODING="utf-8:surrogateescape"

or use open(0, 'rb') or something, depending on what you're trying to do.

Fedora and Python 2

Cyberax — Thu, 12 Apr 2018 08:15:58 +0000

> Use sys.stdin.buffer to get bytes rather than UTF-8.
And remember to never, ever mix them in the same program. Oh, and it'll mostly work if you forget .buffer in one place. It'll just crash with bad data sometimes.

Fedora and Python 2

niner — Thu, 12 Apr 2018 08:11:04 +0000

To get a feeling for activity in Perl 6, I can heartily recommend https://p6weekly.wordpress.com/

Fedora and Python 2

njs — Thu, 12 Apr 2018 05:44:18 +0000

Whoops, this was already addressed; I just misread the threading.